주어진 데이터 셋인 Wine Quality Dataset을 시각적으로 분석하여 fixed acidity, volatile acidity 등 다른 요소들과 quality간의 상관 관계를 분석하고 3개의 모델을 사용하여 두 가지 실험을 진행하여 다양한 품질 점수 카테고리로 분류한다. 이때, 세 가지 ML모델을 학습하고 어떤 모델이 가장 좋은 지 결정한다
실험은 총 2가지로 진행되었다. 모든 요소들을 사용해 quality를 예측하는 실험과 시각적 분석을 통해 quality와 상관관계가 높은(상관관계를 구하여 절댓값이 0.2보다 큰 상관 관계를 갖는) 요소들을 사용하여 quality를 예측하는 실험이다. 모든 실험은 아래와 같은 환경에서 진행되었다. 진행되었다.
# 필요한 패키지 설치 및 불러오기
# install.packages("dplyr")
# install.packages("readr")
# install.packages("ggplot2")
# install.packages("corrplot")
# install.packages("e1071")
# install.packages("randomForest")
# install.packages("rpart")
# install.packages("caret")
# install.packages("yardstick")
# install.packages("MLmetrics")
# install.packages("MASS")
library(dplyr)
library(readr)
library(ggplot2)
library(corrplot)
library(e1071)
library(randomForest)
library(rpart)
library(caret)
library(ggplot2)
library(corrplot)
library(caret)
library(e1071)
library(yardstick)
library(MLmetrics)
library(MASS)
다음의 패키지를 부착합니다: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
corrplot 0.92 loaded
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
다음의 패키지를 부착합니다: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:dplyr’:
combine
필요한 패키지를 로딩중입니다: lattice
다음의 패키지를 부착합니다: ‘yardstick’
The following objects are masked from ‘package:caret’:
precision, recall, sensitivity, specificity
The following object is masked from ‘package:readr’:
spec
다음의 패키지를 부착합니다: ‘MLmetrics’
The following objects are masked from ‘package:caret’:
MAE, RMSE
The following object is masked from ‘package:base’:
Recall
다음의 패키지를 부착합니다: ‘MASS’
The following object is masked from ‘package:dplyr’:
select
# 레드 와인 데이터 읽기
wine_red <- read_csv('/Users/kimsongha/Projects/2023-2/WineQT.csv', col_types = cols())
# 전체 데이터 프레임에서 누락된 값이 있는지 확인
sum(is.na(wine_red))
head(wine_red)
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | Id |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 0 |
| 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 | 1 |
| 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 | 2 |
| 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 | 3 |
| 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 4 |
| 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 | 5 |
names(wine_red)
names(wine_red) <- gsub(" ", "_", names(wine_red))
names(wine_red)
# 기본 정보 출력 함수 정의
info <- function(df) {
# 데이터 프레임의 기본 구조
print(str(df))
# 각 열의 요약 정보
sapply(df, function(x) {
c(type = class(x), n_missing = sum(is.na(x)))
})
}
# 함수 실행
info(wine_red)
spc_tbl_ [1,143 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ fixed_acidity : num [1:1143] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ... $ volatile_acidity : num [1:1143] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ... $ citric_acid : num [1:1143] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ... $ residual_sugar : num [1:1143] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ... $ chlorides : num [1:1143] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ... $ free_sulfur_dioxide : num [1:1143] 11 25 15 17 11 13 15 15 9 15 ... $ total_sulfur_dioxide: num [1:1143] 34 67 54 60 34 40 59 21 18 65 ... $ density : num [1:1143] 0.998 0.997 0.997 0.998 0.998 ... $ pH : num [1:1143] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ... $ sulphates : num [1:1143] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ... $ alcohol : num [1:1143] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ... $ quality : num [1:1143] 5 5 5 6 5 5 5 7 7 5 ... $ Id : num [1:1143] 0 1 2 3 4 5 6 7 8 10 ... - attr(*, "spec")= .. cols( .. `fixed acidity` = col_double(), .. `volatile acidity` = col_double(), .. `citric acid` = col_double(), .. `residual sugar` = col_double(), .. chlorides = col_double(), .. `free sulfur dioxide` = col_double(), .. `total sulfur dioxide` = col_double(), .. density = col_double(), .. pH = col_double(), .. sulphates = col_double(), .. alcohol = col_double(), .. quality = col_double(), .. Id = col_double() .. ) - attr(*, "problems")=<externalptr> NULL
| fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | pH | sulphates | alcohol | quality | Id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| type | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric | numeric |
| n_missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
# 이상치 제거 (Z-점수 기반)
z_scores <- as.data.frame(scale(wine_red))
# 모든 Z 점수가 표준 편차 3점 미만인 행 식별하기
filtered_entries <- apply(z_scores, 1, function(x) all(abs(x) < 3))
# 필터링된 값으로 새 데이터프레임 만들기
wine_new <- wine_red[filtered_entries, ]
wine_new_rows <- nrow(wine_new)
# 원본 데이터프레임에 있는 행의 수 확인
wine_rows <- nrow(wine_red)
# 데이터셋에서 줄어든 행의 수 계산
wine_reduction <- wine_rows - wine_new_rows
# 전체 행에서 줄어든 행의 백분율 계산
wine_reduction_percent <- (wine_reduction / wine_rows) * 100
# 결과 출력
cat(wine_reduction, "outliers have been removed from the wine_red dataset, which represents", round(wine_reduction_percent, 2), "% of the original dataset.")
102 outliers have been removed from the wine_red dataset, which represents 8.92 % of the original dataset.
# 원본 데이터 세트: 각 품질 등급에 대한 와인 개수 계산
wine_quality_counts <- table(wine_red$quality)
wine_quality_counts_sorted <- sort(wine_quality_counts)
# 원본 데이터 세트에 대해 정렬된 품질 개수 인쇄하기
cat("Original Dataset Quality Counts:\n")
print(wine_quality_counts_sorted)
cat("\nNumber of rows in original dataset:", nrow(wine_red), "\n")
cat("Number of columns in original dataset:", ncol(wine_red), "\n\n")
Original Dataset Quality Counts: 3 8 4 7 6 5 6 16 33 143 462 483 Number of rows in original dataset: 1143 Number of columns in original dataset: 13
# 필터링된 데이터 집합: 필터링된 데이터 집합에 대해 동일한 작업을 수행 (wine_new)
wine_new_quality_counts <- table(wine_new$quality)
wine_new_quality_counts_sorted <- sort(wine_new_quality_counts)
# 새 데이터 세트에 대해 정렬된 품질 카운트 인쇄하기
cat("Filtered Dataset Quality Counts:\n")
print(wine_new_quality_counts_sorted)
cat("\nNumber of rows in filtered dataset:", nrow(wine_new), "\n")
cat("Number of columns in filtered dataset:", ncol(wine_new), "\n")
Filtered Dataset Quality Counts: 8 4 7 6 5 15 30 131 425 440 Number of rows in filtered dataset: 1041 Number of columns in filtered dataset: 13
# 상관 관계 계산
correlation_matrix <- cor(wine_new[, sapply(wine_red, is.numeric)])
print(correlation_matrix)
fixed_acidity volatile_acidity citric_acid residual_sugar
fixed_acidity 1.0000000 -0.29604463 0.700352188 0.20696874
volatile_acidity -0.2960446 1.00000000 -0.586476504 0.01226155
citric_acid 0.7003522 -0.58647650 1.000000000 0.21571330
residual_sugar 0.2069687 0.01226155 0.215713301 1.00000000
chlorides 0.1961650 0.08279813 0.107235020 0.09345136
free_sulfur_dioxide -0.1577831 0.01778468 -0.074015963 -0.01924237
total_sulfur_dioxide -0.1023435 0.11794992 -0.006489414 0.06346721
density 0.6786199 -0.01950458 0.388885280 0.34321986
pH -0.7019710 0.23564764 -0.509549151 -0.07876993
sulphates 0.1961744 -0.36396298 0.307067099 0.05573806
alcohol -0.0235932 -0.20835754 0.168178046 0.19384165
quality 0.1443043 -0.37350741 0.266761637 0.07724509
Id -0.2853926 -0.02949566 -0.133015928 -0.09517138
chlorides free_sulfur_dioxide total_sulfur_dioxide
fixed_acidity 0.196164985 -0.15778312 -0.102343458
volatile_acidity 0.082798128 0.01778468 0.117949922
citric_acid 0.107235020 -0.07401596 -0.006489414
residual_sugar 0.093451361 -0.01924237 0.063467207
chlorides 1.000000000 -0.04254594 0.057203637
free_sulfur_dioxide -0.042545937 1.00000000 0.655242716
total_sulfur_dioxide 0.057203637 0.65524272 1.000000000
density 0.329171856 -0.06592196 0.099270433
pH -0.185002111 0.10598149 0.006074196
sulphates 0.004713258 0.01746152 -0.071416096
alcohol -0.234142974 -0.06285188 -0.251442717
quality -0.129100616 -0.07736371 -0.236611878
Id -0.144497253 0.08832727 -0.132665744
density pH sulphates alcohol
fixed_acidity 0.67861989 -0.701970987 0.196174438 -0.02359320
volatile_acidity -0.01950458 0.235647635 -0.363962980 -0.20835754
citric_acid 0.38888528 -0.509549151 0.307067099 0.16817805
residual_sugar 0.34321986 -0.078769926 0.055738062 0.19384165
chlorides 0.32917186 -0.185002111 0.004713258 -0.23414297
free_sulfur_dioxide -0.06592196 0.105981491 0.017461522 -0.06285188
total_sulfur_dioxide 0.09927043 0.006074196 -0.071416096 -0.25144272
density 1.00000000 -0.317353168 0.117556809 -0.45278475
pH -0.31735317 1.000000000 -0.027227672 0.16022233
sulphates 0.11755681 -0.027227672 1.000000000 0.24752283
alcohol -0.45278475 0.160222329 0.247522831 1.00000000
quality -0.15715430 -0.075023033 0.401926032 0.50904036
Id -0.41551133 0.130624358 -0.054309206 0.27235851
quality Id
fixed_acidity 0.14430434 -0.28539256
volatile_acidity -0.37350741 -0.02949566
citric_acid 0.26676164 -0.13301593
residual_sugar 0.07724509 -0.09517138
chlorides -0.12910062 -0.14449725
free_sulfur_dioxide -0.07736371 0.08832727
total_sulfur_dioxide -0.23661188 -0.13266574
density -0.15715430 -0.41551133
pH -0.07502303 0.13062436
sulphates 0.40192603 -0.05430921
alcohol 0.50904036 0.27235851
quality 1.00000000 0.09022289
Id 0.09022289 1.00000000
library(reshape2)
# 상관 관계 행렬을 긴 형태로 변환
correlation_melted <- melt(correlation_matrix)
# ggplot을 사용하여 히트맵 생성
options(repr.plot.width=10, repr.plot.height=10)
ggplot(correlation_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "dodgerblue3", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.margin = unit(c(1, 1, 4, 1), "cm")) + # 가로 길이 조정
labs(x = '', y = '')
# 색상 벡터 생성
colors <- ifelse(wine_new$quality > 5, "palevioletred", "sky blue")
# 쌍별 플롯
options(repr.plot.width=15, repr.plot.height=15)
pairs(wine_new[,c('quality', 'fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol')],
main = "Pairs Plot of Red Wine Data",
bg = colors, pch = 21)
# 추가적인 시각화
# 퀄리티 별 알코올 함량 비교
options(repr.plot.width=10, repr.plot.height=10)
ggplot(wine_new, aes(x=factor(quality), y=alcohol, fill=factor(quality))) +
geom_boxplot() +
labs(title="Alcohol Content by Quality", x="Quality", y="Alcohol Content") +
theme_minimal()
# 퀄리티 별 산도 비교
options(repr.plot.width=10, repr.plot.height=10)
ggplot(wine_new, aes(x=factor(quality), y=pH, fill=factor(quality))) +
geom_boxplot() +
labs(title="pH Level by Quality", x="Quality", y="pH Level") +
theme_minimal()
# Trian, Test데이터 분할
X <- wine_new[, !(names(wine_new) %in% 'quality')]
y <- wine_new$quality
# Create training and test datasets
set.seed(123) # Setting a seed for reproducibility
splitIndex <- createDataPartition(y, p = 0.70, list = FALSE)
X_train <- X[splitIndex,]
y_train <- y[splitIndex]
X_test <- X[-splitIndex,]
y_test <- y[-splitIndex]
print(dim(wine_new))
print(dim(X_train))
print(dim(X_test))
print(length(y_train))
print(length(y_test))
[1] 1041 13 [1] 730 12 [1] 311 12 [1] 730 [1] 311
# SVM 모델
y_train <- as.factor(y_train)
model <- svm(y_train ~ ., data = X_train, probability = TRUE)
print(summary(model))
Call:
svm(formula = y_train ~ ., data = X_train, probability = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 634
( 239 93 271 22 9 )
Number of Classes: 5
Levels:
4 5 6 7 8
# 변환
predictions <- predict(model, X_test)
predictions_numeric <- as.numeric(levels(predictions))[predictions]
# 반올림
predictions_f <- as.factor(round(predictions_numeric))
# 혼동 행렬
y_test_f <- as.factor(y_test)
cm <- confusionMatrix(predictions_f, y_test_f)
print(cm)
Warning message in levels(reference) != levels(data): “두 객체의 길이가 서로 배수관계에 있지 않습니다” Warning message in confusionMatrix.default(predictions_f, y_test_f): “Levels are not in the same order for reference and data. Refactoring data to match.”
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 0 0 0 0 0
5 6 94 33 2 0
6 2 36 84 23 4
7 0 3 10 12 2
8 0 0 0 0 0
Overall Statistics
Accuracy : 0.6109
95% CI : (0.5543, 0.6654)
No Information Rate : 0.4277
P-Value [Acc > NIR] : 6.203e-11
Kappa : 0.3605
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.00000 0.7068 0.6614 0.32432 0.00000
Specificity 1.00000 0.7697 0.6467 0.94526 1.00000
Pos Pred Value NaN 0.6963 0.5638 0.44444 NaN
Neg Pred Value 0.97428 0.7784 0.7346 0.91197 0.98071
Prevalence 0.02572 0.4277 0.4084 0.11897 0.01929
Detection Rate 0.00000 0.3023 0.2701 0.03859 0.00000
Detection Prevalence 0.00000 0.4341 0.4791 0.08682 0.00000
Balanced Accuracy 0.50000 0.7382 0.6541 0.63479 0.50000
# 혼동 행렬을 기반으로 정밀도, 재현율, F1 점수를 계산
metrics <- data.frame(
class = rownames(cm$byClass),
precision = cm$byClass[, "Precision"],
recall = cm$byClass[, "Recall"],
f1 = cm$byClass[, "F1"]
)
# 클래스를 기준으로 오름차순 정렬
metrics <- metrics[order(metrics$class), ]
rownames(metrics) <- NULL
metrics[is.na(metrics)] <- 0
# 결과를 출력합니다.
print(metrics, row.names = FALSE)
class precision recall f1 Class: 4 0.0000000 0.0000000 0.0000000 Class: 5 0.6962963 0.7067669 0.7014925 Class: 6 0.5637584 0.6614173 0.6086957 Class: 7 0.4444444 0.3243243 0.3750000 Class: 8 0.0000000 0.0000000 0.0000000
# 의사결정나무 모델
y_train <- as.factor(y_train)
dt_model <- rpart(y_train ~ ., data = X_train)
print(summary(dt_model))
Call:
rpart(formula = y_train ~ ., data = X_train)
n= 730
CP nsplit rel error xerror xstd
1 0.25059102 0 1.0000000 1.0780142 0.03092829
2 0.02285264 1 0.7494090 0.7943262 0.03183581
3 0.01773050 4 0.6808511 0.7328605 0.03157212
4 0.01654846 6 0.6453901 0.7186761 0.03148762
5 0.01418440 9 0.5933806 0.7092199 0.03142629
6 0.01300236 10 0.5791962 0.7092199 0.03142629
7 0.01182033 12 0.5531915 0.7092199 0.03142629
8 0.01063830 13 0.5413712 0.6997636 0.03136093
9 0.01000000 15 0.5200946 0.7044917 0.03139411
Variable importance
alcohol sulphates total_sulfur_dioxide
27 18 12
density volatile_acidity citric_acid
9 6 6
Id chlorides fixed_acidity
5 4 4
free_sulfur_dioxide residual_sugar pH
4 3 1
Node number 1: 730 observations, complexity param=0.250591
predicted class=5 expected loss=0.5794521 P(node) =1
class counts: 22 307 298 94 9
probabilities: 0.030 0.421 0.408 0.129 0.012
left son=2 (289 obs) right son=3 (441 obs)
Primary splits:
alcohol < 9.85 to the left, improve=46.77350, (0 missing)
sulphates < 0.625 to the left, improve=36.90807, (0 missing)
total_sulfur_dioxide < 98.5 to the right, improve=18.46788, (0 missing)
volatile_acidity < 0.365 to the right, improve=16.90836, (0 missing)
citric_acid < 0.295 to the left, improve=14.02981, (0 missing)
Surrogate splits:
density < 0.996555 to the right, agree=0.675, adj=0.180, (0 split)
total_sulfur_dioxide < 64.5 to the right, agree=0.666, adj=0.156, (0 split)
Id < 792.5 to the left, agree=0.652, adj=0.121, (0 split)
sulphates < 0.555 to the left, agree=0.638, adj=0.087, (0 split)
chlorides < 0.142 to the right, agree=0.610, adj=0.014, (0 split)
Node number 2: 289 observations, complexity param=0.0177305
predicted class=5 expected loss=0.3217993 P(node) =0.3958904
class counts: 10 196 81 2 0
probabilities: 0.035 0.678 0.280 0.007 0.000
left son=4 (191 obs) right son=5 (98 obs)
Primary splits:
sulphates < 0.625 to the left, improve=12.978860, (0 missing)
volatile_acidity < 0.405 to the right, improve= 3.806351, (0 missing)
fixed_acidity < 10.85 to the left, improve= 3.635381, (0 missing)
total_sulfur_dioxide < 99.5 to the right, improve= 3.591381, (0 missing)
pH < 3.285 to the left, improve= 3.147822, (0 missing)
Surrogate splits:
fixed_acidity < 9.95 to the left, agree=0.702, adj=0.122, (0 split)
volatile_acidity < 0.385 to the right, agree=0.696, adj=0.102, (0 split)
chlorides < 0.113 to the left, agree=0.689, adj=0.082, (0 split)
pH < 3.57 to the left, agree=0.685, adj=0.071, (0 split)
density < 0.99926 to the left, agree=0.682, adj=0.061, (0 split)
Node number 3: 441 observations, complexity param=0.02285264
predicted class=6 expected loss=0.5079365 P(node) =0.6041096
class counts: 12 111 217 92 9
probabilities: 0.027 0.252 0.492 0.209 0.020
left son=6 (212 obs) right son=7 (229 obs)
Primary splits:
sulphates < 0.635 to the left, improve=18.33728, (0 missing)
citric_acid < 0.315 to the left, improve=12.39486, (0 missing)
total_sulfur_dioxide < 93 to the right, improve=12.36300, (0 missing)
alcohol < 11.45 to the left, improve=12.00557, (0 missing)
volatile_acidity < 0.425 to the right, improve=11.46317, (0 missing)
Surrogate splits:
citric_acid < 0.265 to the left, agree=0.696, adj=0.368, (0 split)
volatile_acidity < 0.485 to the right, agree=0.680, adj=0.335, (0 split)
fixed_acidity < 8.35 to the left, agree=0.649, adj=0.269, (0 split)
density < 0.996675 to the left, agree=0.617, adj=0.203, (0 split)
total_sulfur_dioxide < 20.5 to the left, agree=0.592, adj=0.151, (0 split)
Node number 4: 191 observations
predicted class=5 expected loss=0.2146597 P(node) =0.2616438
class counts: 7 150 33 1 0
probabilities: 0.037 0.785 0.173 0.005 0.000
Node number 5: 98 observations, complexity param=0.0177305
predicted class=6 expected loss=0.5102041 P(node) =0.1342466
class counts: 3 46 48 1 0
probabilities: 0.031 0.469 0.490 0.010 0.000
left son=10 (23 obs) right son=11 (75 obs)
Primary splits:
chlorides < 0.097 to the right, improve=5.209831, (0 missing)
Id < 303 to the left, improve=5.045802, (0 missing)
fixed_acidity < 10.85 to the left, improve=2.881190, (0 missing)
volatile_acidity < 0.575 to the right, improve=2.780232, (0 missing)
free_sulfur_dioxide < 4.5 to the left, improve=2.601474, (0 missing)
Surrogate splits:
sulphates < 0.845 to the right, agree=0.806, adj=0.174, (0 split)
volatile_acidity < 0.875 to the right, agree=0.786, adj=0.087, (0 split)
Id < 23 to the left, agree=0.786, adj=0.087, (0 split)
free_sulfur_dioxide < 29.5 to the right, agree=0.776, adj=0.043, (0 split)
Node number 6: 212 observations, complexity param=0.02285264
predicted class=6 expected loss=0.495283 P(node) =0.290411
class counts: 10 83 107 11 1
probabilities: 0.047 0.392 0.505 0.052 0.005
left son=12 (13 obs) right son=13 (199 obs)
Primary splits:
total_sulfur_dioxide < 89.5 to the right, improve=7.481048, (0 missing)
alcohol < 11.45 to the left, improve=7.182390, (0 missing)
sulphates < 0.545 to the left, improve=4.442367, (0 missing)
Id < 1342.5 to the right, improve=4.395546, (0 missing)
volatile_acidity < 0.575 to the right, improve=3.486944, (0 missing)
Node number 7: 229 observations, complexity param=0.02285264
predicted class=6 expected loss=0.5196507 P(node) =0.3136986
class counts: 2 28 110 81 8
probabilities: 0.009 0.122 0.480 0.354 0.035
left son=14 (164 obs) right son=15 (65 obs)
Primary splits:
alcohol < 11.65 to the left, improve=7.915158, (0 missing)
citric_acid < 0.315 to the left, improve=5.132199, (0 missing)
volatile_acidity < 0.425 to the right, improve=5.020187, (0 missing)
fixed_acidity < 5.5 to the right, improve=4.799402, (0 missing)
residual_sugar < 4.5 to the right, improve=4.456437, (0 missing)
Surrogate splits:
density < 0.99469 to the right, agree=0.777, adj=0.215, (0 split)
fixed_acidity < 5.5 to the right, agree=0.747, adj=0.108, (0 split)
chlorides < 0.0485 to the right, agree=0.738, adj=0.077, (0 split)
residual_sugar < 6.15 to the left, agree=0.734, adj=0.062, (0 split)
volatile_acidity < 0.14 to the right, agree=0.725, adj=0.031, (0 split)
Node number 10: 23 observations
predicted class=5 expected loss=0.2173913 P(node) =0.03150685
class counts: 0 18 5 0 0
probabilities: 0.000 0.783 0.217 0.000 0.000
Node number 11: 75 observations, complexity param=0.0141844
predicted class=6 expected loss=0.4266667 P(node) =0.1027397
class counts: 3 28 43 1 0
probabilities: 0.040 0.373 0.573 0.013 0.000
left son=22 (10 obs) right son=23 (65 obs)
Primary splits:
free_sulfur_dioxide < 5.5 to the left, improve=3.729231, (0 missing)
Id < 305 to the left, improve=3.127647, (0 missing)
alcohol < 9.525 to the right, improve=3.123636, (0 missing)
density < 0.99719 to the left, improve=2.804978, (0 missing)
pH < 3.56 to the right, improve=2.411261, (0 missing)
Surrogate splits:
total_sulfur_dioxide < 12 to the left, agree=0.907, adj=0.3, (0 split)
citric_acid < 0.53 to the right, agree=0.880, adj=0.1, (0 split)
Node number 12: 13 observations
predicted class=5 expected loss=0.07692308 P(node) =0.01780822
class counts: 0 12 0 1 0
probabilities: 0.000 0.923 0.000 0.077 0.000
Node number 13: 199 observations, complexity param=0.01654846
predicted class=6 expected loss=0.4623116 P(node) =0.2726027
class counts: 10 71 107 10 1
probabilities: 0.050 0.357 0.538 0.050 0.005
left son=26 (148 obs) right son=27 (51 obs)
Primary splits:
alcohol < 11.45 to the left, improve=5.977509, (0 missing)
free_sulfur_dioxide < 7.5 to the left, improve=3.407341, (0 missing)
Id < 1418.5 to the right, improve=3.200473, (0 missing)
total_sulfur_dioxide < 14.5 to the left, improve=2.998064, (0 missing)
sulphates < 0.595 to the left, improve=2.781150, (0 missing)
Surrogate splits:
density < 0.994185 to the right, agree=0.854, adj=0.431, (0 split)
fixed_acidity < 6.45 to the right, agree=0.779, adj=0.137, (0 split)
volatile_acidity < 0.285 to the right, agree=0.779, adj=0.137, (0 split)
chlorides < 0.0565 to the right, agree=0.759, adj=0.059, (0 split)
free_sulfur_dioxide < 3.5 to the right, agree=0.754, adj=0.039, (0 split)
Node number 14: 164 observations, complexity param=0.01300236
predicted class=6 expected loss=0.4573171 P(node) =0.2246575
class counts: 2 26 89 43 4
probabilities: 0.012 0.159 0.543 0.262 0.024
left son=28 (7 obs) right son=29 (157 obs)
Primary splits:
total_sulfur_dioxide < 87.5 to the right, improve=5.245711, (0 missing)
residual_sugar < 4.45 to the left, improve=4.980329, (0 missing)
citric_acid < 0.635 to the left, improve=3.484209, (0 missing)
volatile_acidity < 0.375 to the right, improve=3.358473, (0 missing)
free_sulfur_dioxide < 25.5 to the right, improve=2.673474, (0 missing)
Node number 15: 65 observations, complexity param=0.01182033
predicted class=7 expected loss=0.4153846 P(node) =0.0890411
class counts: 0 2 21 38 4
probabilities: 0.000 0.031 0.323 0.585 0.062
left son=30 (27 obs) right son=31 (38 obs)
Primary splits:
sulphates < 0.745 to the left, improve=4.645524, (0 missing)
volatile_acidity < 0.52 to the right, improve=3.699961, (0 missing)
fixed_acidity < 5.5 to the right, improve=2.209549, (0 missing)
total_sulfur_dioxide < 74 to the left, improve=1.855944, (0 missing)
citric_acid < 0.325 to the left, improve=1.854779, (0 missing)
Surrogate splits:
total_sulfur_dioxide < 16.5 to the left, agree=0.677, adj=0.222, (0 split)
pH < 3.575 to the right, agree=0.662, adj=0.185, (0 split)
Id < 1102 to the right, agree=0.662, adj=0.185, (0 split)
chlorides < 0.0825 to the right, agree=0.631, adj=0.111, (0 split)
free_sulfur_dioxide < 6.5 to the left, agree=0.631, adj=0.111, (0 split)
Node number 22: 10 observations
predicted class=5 expected loss=0.2 P(node) =0.01369863
class counts: 0 8 2 0 0
probabilities: 0.000 0.800 0.200 0.000 0.000
Node number 23: 65 observations
predicted class=6 expected loss=0.3692308 P(node) =0.0890411
class counts: 3 20 41 1 0
probabilities: 0.046 0.308 0.631 0.015 0.000
Node number 26: 148 observations, complexity param=0.01654846
predicted class=6 expected loss=0.5135135 P(node) =0.2027397
class counts: 8 65 72 3 0
probabilities: 0.054 0.439 0.486 0.020 0.000
left son=52 (44 obs) right son=53 (104 obs)
Primary splits:
free_sulfur_dioxide < 7.5 to the left, improve=4.525090, (0 missing)
Id < 1359 to the right, improve=3.786670, (0 missing)
citric_acid < 0.075 to the right, improve=3.448458, (0 missing)
total_sulfur_dioxide < 27.5 to the left, improve=2.883629, (0 missing)
density < 0.994375 to the left, improve=2.711004, (0 missing)
Surrogate splits:
total_sulfur_dioxide < 16.5 to the left, agree=0.905, adj=0.682, (0 split)
alcohol < 11.25 to the right, agree=0.736, adj=0.114, (0 split)
chlorides < 0.1195 to the right, agree=0.723, adj=0.068, (0 split)
citric_acid < 0.51 to the right, agree=0.716, adj=0.045, (0 split)
pH < 3.605 to the right, agree=0.716, adj=0.045, (0 split)
Node number 27: 51 observations
predicted class=6 expected loss=0.3137255 P(node) =0.06986301
class counts: 2 6 35 7 1
probabilities: 0.039 0.118 0.686 0.137 0.020
Node number 28: 7 observations
predicted class=5 expected loss=0.1428571 P(node) =0.009589041
class counts: 0 6 1 0 0
probabilities: 0.000 0.857 0.143 0.000 0.000
Node number 29: 157 observations, complexity param=0.01300236
predicted class=6 expected loss=0.4394904 P(node) =0.2150685
class counts: 2 20 88 43 4
probabilities: 0.013 0.127 0.561 0.274 0.025
left son=58 (150 obs) right son=59 (7 obs)
Primary splits:
residual_sugar < 4.45 to the left, improve=4.801978, (0 missing)
citric_acid < 0.635 to the left, improve=3.516554, (0 missing)
volatile_acidity < 0.375 to the right, improve=3.407650, (0 missing)
total_sulfur_dioxide < 48 to the right, improve=2.696381, (0 missing)
pH < 3.08 to the right, improve=2.214213, (0 missing)
Surrogate splits:
density < 1.0016 to the left, agree=0.968, adj=0.286, (0 split)
Node number 30: 27 observations
predicted class=6 expected loss=0.4444444 P(node) =0.0369863
class counts: 0 1 15 10 1
probabilities: 0.000 0.037 0.556 0.370 0.037
Node number 31: 38 observations
predicted class=7 expected loss=0.2631579 P(node) =0.05205479
class counts: 0 1 6 28 3
probabilities: 0.000 0.026 0.158 0.737 0.079
Node number 52: 44 observations
predicted class=5 expected loss=0.4090909 P(node) =0.06027397
class counts: 5 26 12 1 0
probabilities: 0.114 0.591 0.273 0.023 0.000
Node number 53: 104 observations, complexity param=0.01654846
predicted class=6 expected loss=0.4230769 P(node) =0.1424658
class counts: 3 39 60 2 0
probabilities: 0.029 0.375 0.577 0.019 0.000
left son=106 (19 obs) right son=107 (85 obs)
Primary splits:
Id < 1416 to the right, improve=4.533067, (0 missing)
citric_acid < 0.075 to the right, improve=3.809731, (0 missing)
fixed_acidity < 10.9 to the right, improve=2.155449, (0 missing)
alcohol < 10.65 to the left, improve=1.802317, (0 missing)
volatile_acidity < 0.9475 to the right, improve=1.771582, (0 missing)
Node number 58: 150 observations, complexity param=0.0106383
predicted class=6 expected loss=0.4133333 P(node) =0.2054795
class counts: 2 19 88 37 4
probabilities: 0.013 0.127 0.587 0.247 0.027
left son=116 (89 obs) right son=117 (61 obs)
Primary splits:
volatile_acidity < 0.375 to the right, improve=4.432951, (0 missing)
citric_acid < 0.315 to the left, improve=2.413058, (0 missing)
total_sulfur_dioxide < 44.5 to the right, improve=2.382561, (0 missing)
density < 0.99785 to the right, improve=2.026125, (0 missing)
alcohol < 10.75 to the left, improve=1.969147, (0 missing)
Surrogate splits:
citric_acid < 0.325 to the left, agree=0.733, adj=0.344, (0 split)
sulphates < 0.765 to the left, agree=0.673, adj=0.197, (0 split)
free_sulfur_dioxide < 5.5 to the right, agree=0.660, adj=0.164, (0 split)
residual_sugar < 1.65 to the right, agree=0.633, adj=0.098, (0 split)
total_sulfur_dioxide < 13.5 to the right, agree=0.633, adj=0.098, (0 split)
Node number 59: 7 observations
predicted class=7 expected loss=0.1428571 P(node) =0.009589041
class counts: 0 1 0 6 0
probabilities: 0.000 0.143 0.000 0.857 0.000
Node number 106: 19 observations
predicted class=5 expected loss=0.3157895 P(node) =0.0260274
class counts: 1 13 5 0 0
probabilities: 0.053 0.684 0.263 0.000 0.000
Node number 107: 85 observations
predicted class=6 expected loss=0.3529412 P(node) =0.1164384
class counts: 2 26 55 2 0
probabilities: 0.024 0.306 0.647 0.024 0.000
Node number 116: 89 observations
predicted class=6 expected loss=0.3146067 P(node) =0.1219178
class counts: 2 12 61 13 1
probabilities: 0.022 0.135 0.685 0.146 0.011
Node number 117: 61 observations, complexity param=0.0106383
predicted class=6 expected loss=0.557377 P(node) =0.08356164
class counts: 0 7 27 24 3
probabilities: 0.000 0.115 0.443 0.393 0.049
left son=234 (26 obs) right son=235 (35 obs)
Primary splits:
citric_acid < 0.475 to the right, improve=3.860133, (0 missing)
alcohol < 10.45 to the left, improve=3.675182, (0 missing)
total_sulfur_dioxide < 49.5 to the right, improve=3.532101, (0 missing)
pH < 3.265 to the right, improve=3.145848, (0 missing)
volatile_acidity < 0.235 to the left, improve=3.083189, (0 missing)
Surrogate splits:
fixed_acidity < 10.55 to the right, agree=0.787, adj=0.500, (0 split)
density < 0.99795 to the right, agree=0.787, adj=0.500, (0 split)
residual_sugar < 2.45 to the right, agree=0.738, adj=0.385, (0 split)
chlorides < 0.0805 to the right, agree=0.738, adj=0.385, (0 split)
alcohol < 10.75 to the left, agree=0.672, adj=0.231, (0 split)
Node number 234: 26 observations
predicted class=6 expected loss=0.3461538 P(node) =0.03561644
class counts: 0 3 17 5 1
probabilities: 0.000 0.115 0.654 0.192 0.038
Node number 235: 35 observations
predicted class=7 expected loss=0.4571429 P(node) =0.04794521
class counts: 0 4 10 19 2
probabilities: 0.000 0.114 0.286 0.543 0.057
n= 730
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 730 423 5 (0.03 0.42 0.41 0.13 0.012)
2) alcohol< 9.85 289 93 5 (0.035 0.68 0.28 0.0069 0)
4) sulphates< 0.625 191 41 5 (0.037 0.79 0.17 0.0052 0) *
5) sulphates>=0.625 98 50 6 (0.031 0.47 0.49 0.01 0)
10) chlorides>=0.097 23 5 5 (0 0.78 0.22 0 0) *
11) chlorides< 0.097 75 32 6 (0.04 0.37 0.57 0.013 0)
22) free_sulfur_dioxide< 5.5 10 2 5 (0 0.8 0.2 0 0) *
23) free_sulfur_dioxide>=5.5 65 24 6 (0.046 0.31 0.63 0.015 0) *
3) alcohol>=9.85 441 224 6 (0.027 0.25 0.49 0.21 0.02)
6) sulphates< 0.635 212 105 6 (0.047 0.39 0.5 0.052 0.0047)
12) total_sulfur_dioxide>=89.5 13 1 5 (0 0.92 0 0.077 0) *
13) total_sulfur_dioxide< 89.5 199 92 6 (0.05 0.36 0.54 0.05 0.005)
26) alcohol< 11.45 148 76 6 (0.054 0.44 0.49 0.02 0)
52) free_sulfur_dioxide< 7.5 44 18 5 (0.11 0.59 0.27 0.023 0) *
53) free_sulfur_dioxide>=7.5 104 44 6 (0.029 0.38 0.58 0.019 0)
106) Id>=1416 19 6 5 (0.053 0.68 0.26 0 0) *
107) Id< 1416 85 30 6 (0.024 0.31 0.65 0.024 0) *
27) alcohol>=11.45 51 16 6 (0.039 0.12 0.69 0.14 0.02) *
7) sulphates>=0.635 229 119 6 (0.0087 0.12 0.48 0.35 0.035)
14) alcohol< 11.65 164 75 6 (0.012 0.16 0.54 0.26 0.024)
28) total_sulfur_dioxide>=87.5 7 1 5 (0 0.86 0.14 0 0) *
29) total_sulfur_dioxide< 87.5 157 69 6 (0.013 0.13 0.56 0.27 0.025)
58) residual_sugar< 4.45 150 62 6 (0.013 0.13 0.59 0.25 0.027)
116) volatile_acidity>=0.375 89 28 6 (0.022 0.13 0.69 0.15 0.011) *
117) volatile_acidity< 0.375 61 34 6 (0 0.11 0.44 0.39 0.049)
234) citric_acid>=0.475 26 9 6 (0 0.12 0.65 0.19 0.038) *
235) citric_acid< 0.475 35 16 7 (0 0.11 0.29 0.54 0.057) *
59) residual_sugar>=4.45 7 1 7 (0 0.14 0 0.86 0) *
15) alcohol>=11.65 65 27 7 (0 0.031 0.32 0.58 0.062)
30) sulphates< 0.745 27 12 6 (0 0.037 0.56 0.37 0.037) *
31) sulphates>=0.745 38 10 7 (0 0.026 0.16 0.74 0.079) *
# 예측
predictions_dt <- predict(dt_model, X_test, type = "class")
y_test_f <- as.factor(y_test)
cm <- confusionMatrix(predictions_dt, y_test_f)
print(cm)
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 0 0 0 0 0
5 4 82 36 1 0
6 4 48 78 27 5
7 0 3 13 9 1
8 0 0 0 0 0
Overall Statistics
Accuracy : 0.5434
95% CI : (0.4863, 0.5997)
No Information Rate : 0.4277
P-Value [Acc > NIR] : 2.637e-05
Kappa : 0.2493
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.00000 0.6165 0.6142 0.24324 0.00000
Specificity 1.00000 0.7697 0.5435 0.93796 1.00000
Pos Pred Value NaN 0.6667 0.4815 0.34615 NaN
Neg Pred Value 0.97428 0.7287 0.6711 0.90175 0.98071
Prevalence 0.02572 0.4277 0.4084 0.11897 0.01929
Detection Rate 0.00000 0.2637 0.2508 0.02894 0.00000
Detection Prevalence 0.00000 0.3955 0.5209 0.08360 0.00000
Balanced Accuracy 0.50000 0.6931 0.5788 0.59060 0.50000
# 혼동 행렬을 기반으로 정밀도, 재현율, F1 점수를 계산
metrics <- data.frame(
class = rownames(cm$byClass),
precision = cm$byClass[, "Precision"],
recall = cm$byClass[, "Recall"],
f1 = cm$byClass[, "F1"]
)
# 클래스를 기준으로 오름차순 정렬
metrics <- metrics[order(metrics$class), ]
rownames(metrics) <- NULL
metrics[is.na(metrics)] <- 0
# 결과를 출력합니다.
print(metrics, row.names = FALSE)
class precision recall f1 Class: 4 0.0000000 0.0000000 0.0000000 Class: 5 0.6666667 0.6165414 0.6406250 Class: 6 0.4814815 0.6141732 0.5397924 Class: 7 0.3461538 0.2432432 0.2857143 Class: 8 0.0000000 0.0000000 0.0000000
# 랜덤포레스트 모델
y_train <- as.factor(y_train)
rf_model <- randomForest(y_train ~ ., data = X_train)
print(summary(rf_model))
Length Class Mode call 3 -none- call type 1 -none- character predicted 730 factor numeric err.rate 3000 -none- numeric confusion 30 -none- numeric votes 3650 matrix numeric oob.times 730 -none- numeric classes 5 -none- character importance 12 -none- numeric importanceSD 0 -none- NULL localImportance 0 -none- NULL proximity 0 -none- NULL ntree 1 -none- numeric mtry 1 -none- numeric forest 14 -none- list y 730 factor numeric test 0 -none- NULL inbag 0 -none- NULL terms 3 terms call
# 예측
predictions_rf <- predict(rf_model, X_test)
predictions_rf_num <- as.numeric(as.character(predictions_rf))
predictions_rf_f <- as.factor(round(predictions_rf_num))
y_test_f <- as.factor(y_test)
cm <- confusionMatrix(predictions_rf_f, y_test_f)
print(cm)
Warning message in levels(reference) != levels(data): “두 객체의 길이가 서로 배수관계에 있지 않습니다” Warning message in confusionMatrix.default(predictions_rf_f, y_test_f): “Levels are not in the same order for reference and data. Refactoring data to match.”
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 0 0 0 0 0
5 6 98 31 2 0
6 2 35 86 20 4
7 0 0 10 15 1
8 0 0 0 0 1
Overall Statistics
Accuracy : 0.6431
95% CI : (0.5871, 0.6964)
No Information Rate : 0.4277
P-Value [Acc > NIR] : 1.706e-14
Kappa : 0.4135
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.00000 0.7368 0.6772 0.40541 0.166667
Specificity 1.00000 0.7809 0.6685 0.95985 1.000000
Pos Pred Value NaN 0.7153 0.5850 0.57692 1.000000
Neg Pred Value 0.97428 0.7989 0.7500 0.92281 0.983871
Prevalence 0.02572 0.4277 0.4084 0.11897 0.019293
Detection Rate 0.00000 0.3151 0.2765 0.04823 0.003215
Detection Prevalence 0.00000 0.4405 0.4727 0.08360 0.003215
Balanced Accuracy 0.50000 0.7589 0.6728 0.68263 0.583333
# 혼동 행렬을 기반으로 정밀도, 재현율, F1 점수를 계산
metrics <- data.frame(
class = rownames(cm$byClass),
precision = cm$byClass[, "Precision"],
recall = cm$byClass[, "Recall"],
f1 = cm$byClass[, "F1"]
)
# 클래스를 기준으로 오름차순 정렬
metrics <- metrics[order(metrics$class), ]
rownames(metrics) <- NULL
metrics[is.na(metrics)] <- 0
# 결과를 출력합니다.
print(metrics, row.names = FALSE)
class precision recall f1 Class: 4 0.0000000 0.0000000 0.0000000 Class: 5 0.7153285 0.7368421 0.7259259 Class: 6 0.5850340 0.6771654 0.6277372 Class: 7 0.5769231 0.4054054 0.4761905 Class: 8 1.0000000 0.1666667 0.2857143
# SVM
svm_pred <- predict(model, X_test, type = "class")
svm_acc <- sum(svm_pred == y_test) / length(y_test)
# Decision Tree
dt_pred <- predict(dt_model, X_test, type = "class")
dt_acc <- sum(dt_pred == y_test) / length(y_test)
# Random Forest
rf_pred <- predict(rf_model, X_test, type = "class")
rf_acc <- sum(rf_pred == y_test) / length(y_test)
# Ordinal Logistic Regression
ordinal_model <- polr(as.ordered(y_train) ~ ., data = X_train, Hess=TRUE)
ordinal_pred <- predict(ordinal_model , X_test, type = "class")
ordinal_acc <- sum(ordinal_pred == y_test) / length(y_test)
# 결과 출력
cat("SVM Accuracy: ", svm_acc, "\n")
cat("Decision Tree Accuracy: ", dt_acc, "\n")
cat("Random Forest Accuracy: ", rf_acc, "\n")
cat("Ordinal Logistic Regression Accuracy: ", ordinal_acc, "\n")
SVM Accuracy: 0.6109325 Decision Tree Accuracy: 0.5434084 Random Forest Accuracy: 0.6430868 Ordinal Logistic Regression Accuracy: 0.6012862
# 필요한 패키지 로드
library(pROC)
# 각 모델에 대한 예측 확률 계산
svm_prob <- predict(model, X_test, probability = TRUE)
dt_prob <- predict(dt_model, X_test, type = "prob")
rf_prob <- predict(rf_model, X_test, type = "prob")
svm_probabilities <- attr(svm_prob, "probabilities")
ordinal_prob <- predict(ordinal_model, X_test, type="probs")
# 각 모델과 클래스에 대한 AUROC 계산
svm_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_index <- which(colnames(svm_probabilities) == class)
class_prob <- svm_probabilities[, class_index]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
dt_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_prob = dt_prob[,class]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
rf_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_prob = rf_prob[,class]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
ordinal_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_prob = ordinal_prob[,class]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
# 각 모델에 대한 AUROC 출력
cat("[ SVM AUROC ]\n")
for (i in seq_along(svm_rocs)) {
cat("SVM AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(svm_rocs[[i]]), "\n")
}
cat("\n[ Decision Tree AUROC ]\n")
for (i in seq_along(dt_rocs)) {
cat("Decision Tree AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(dt_rocs[[i]]), "\n")
}
cat("\n[ Random Forest AUROC ]\n")
for (i in seq_along(rf_rocs)) {
cat("Random Forest AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(rf_rocs[[i]]), "\n")
}
cat("\n[ Ordinal Logistic Regression AUROC ]\n")
for (i in seq_along(ordinal_rocs)) {
cat("Ordinal Logistic Regression AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(ordinal_rocs[[i]]), "\n")
}
Type 'citation("pROC")' for a citation.
다음의 패키지를 부착합니다: ‘pROC’
The following objects are masked from ‘package:stats’:
cov, smooth, var
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
Setting direction: controls < cases
[ SVM AUROC ] SVM AUROC for class 4 : 0.7801155 SVM AUROC for class 5 : 0.7949227 SVM AUROC for class 6 : 0.6955238 SVM AUROC for class 7 : 0.8758138 SVM AUROC for class 8 : 0.8754098 [ Decision Tree AUROC ] Decision Tree AUROC for class 4 : 0.615099 Decision Tree AUROC for class 5 : 0.7746684 Decision Tree AUROC for class 6 : 0.6252782 Decision Tree AUROC for class 7 : 0.796015 Decision Tree AUROC for class 8 : 0.8295082 [ Random Forest AUROC ] Random Forest AUROC for class 4 : 0.7689769 Random Forest AUROC for class 5 : 0.8386838 Random Forest AUROC for class 6 : 0.7344231 Random Forest AUROC for class 7 : 0.8816335 Random Forest AUROC for class 8 : 0.8527322 [ Ordinal Logistic Regression AUROC ] Ordinal Logistic Regression AUROC for class 4 : 0.6716172 Ordinal Logistic Regression AUROC for class 5 : 0.8186618 Ordinal Logistic Regression AUROC for class 6 : 0.6928278 Ordinal Logistic Regression AUROC for class 7 : 0.8663444 Ordinal Logistic Regression AUROC for class 8 : 0.904918
X <- wine_new[, c("volatile_acidity", "citric_acid", "total_sulfur_dioxide", "sulphates", "alcohol")]
y <- wine_new$quality
# Create training and test datasets
set.seed(123) # Setting a seed for reproducibility
splitIndex <- createDataPartition(y, p = 0.70, list = FALSE)
X_train <- X[splitIndex,]
y_train <- y[splitIndex]
X_test <- X[-splitIndex,]
y_test <- y[-splitIndex]
print(dim(X_train))
print(dim(X_test))
print(length(y_train))
print(length(y_test))
[1] 730 5 [1] 311 5 [1] 730 [1] 311
# SVM 모델
y_train <- as.factor(y_train)
model <- svm(y_train ~ ., data = X_train, probability = TRUE)
print(summary(model))
Call:
svm(formula = y_train ~ ., data = X_train, probability = TRUE)
Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
Number of Support Vectors: 589
( 218 88 252 22 9 )
Number of Classes: 5
Levels:
4 5 6 7 8
# 변환
predictions_numeric <- as.numeric(levels(predictions))[predictions]
# 반올림
predictions_f <- as.factor(round(predictions_numeric))
# 혼동 행렬
cm <- confusionMatrix(predictions_f, y_test_f)
print(cm)
Warning message in levels(reference) != levels(data): “두 객체의 길이가 서로 배수관계에 있지 않습니다” Warning message in confusionMatrix.default(predictions_f, y_test_f): “Levels are not in the same order for reference and data. Refactoring data to match.”
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 0 0 0 0 0
5 6 94 33 2 0
6 2 36 84 23 4
7 0 3 10 12 2
8 0 0 0 0 0
Overall Statistics
Accuracy : 0.6109
95% CI : (0.5543, 0.6654)
No Information Rate : 0.4277
P-Value [Acc > NIR] : 6.203e-11
Kappa : 0.3605
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.00000 0.7068 0.6614 0.32432 0.00000
Specificity 1.00000 0.7697 0.6467 0.94526 1.00000
Pos Pred Value NaN 0.6963 0.5638 0.44444 NaN
Neg Pred Value 0.97428 0.7784 0.7346 0.91197 0.98071
Prevalence 0.02572 0.4277 0.4084 0.11897 0.01929
Detection Rate 0.00000 0.3023 0.2701 0.03859 0.00000
Detection Prevalence 0.00000 0.4341 0.4791 0.08682 0.00000
Balanced Accuracy 0.50000 0.7382 0.6541 0.63479 0.50000
# 혼동 행렬을 기반으로 정밀도, 재현율, F1 점수를 계산
metrics <- data.frame(
class = rownames(cm$byClass),
precision = cm$byClass[, "Precision"],
recall = cm$byClass[, "Recall"],
f1 = cm$byClass[, "F1"]
)
# 클래스를 기준으로 오름차순 정렬
metrics <- metrics[order(metrics$class), ]
rownames(metrics) <- NULL
metrics[is.na(metrics)] <- 0
# 결과를 출력합니다.
print(metrics, row.names = FALSE)
class precision recall f1 Class: 4 0.0000000 0.0000000 0.0000000 Class: 5 0.6962963 0.7067669 0.7014925 Class: 6 0.5637584 0.6614173 0.6086957 Class: 7 0.4444444 0.3243243 0.3750000 Class: 8 0.0000000 0.0000000 0.0000000
# 의사결정나무 모델
y_train <- as.factor(y_train)
dt_model <- rpart(y_train ~ ., data = X_train)
print(summary(dt_model))
Call:
rpart(formula = y_train ~ ., data = X_train)
n= 730
CP nsplit rel error xerror xstd
1 0.25059102 0 1.0000000 1.0780142 0.03092829
2 0.02285264 1 0.7494090 0.7943262 0.03183581
3 0.01418440 4 0.6808511 0.7139480 0.03145745
4 0.01182033 6 0.6524823 0.7257683 0.03153099
5 0.01000000 15 0.5437352 0.7139480 0.03145745
Variable importance
alcohol sulphates total_sulfur_dioxide
39 23 15
citric_acid volatile_acidity
13 11
Node number 1: 730 observations, complexity param=0.250591
predicted class=5 expected loss=0.5794521 P(node) =1
class counts: 22 307 298 94 9
probabilities: 0.030 0.421 0.408 0.129 0.012
left son=2 (289 obs) right son=3 (441 obs)
Primary splits:
alcohol < 9.85 to the left, improve=46.77350, (0 missing)
sulphates < 0.625 to the left, improve=36.90807, (0 missing)
total_sulfur_dioxide < 98.5 to the right, improve=18.46788, (0 missing)
volatile_acidity < 0.365 to the right, improve=16.90836, (0 missing)
citric_acid < 0.295 to the left, improve=14.02981, (0 missing)
Surrogate splits:
total_sulfur_dioxide < 64.5 to the right, agree=0.666, adj=0.156, (0 split)
sulphates < 0.555 to the left, agree=0.638, adj=0.087, (0 split)
Node number 2: 289 observations, complexity param=0.0141844
predicted class=5 expected loss=0.3217993 P(node) =0.3958904
class counts: 10 196 81 2 0
probabilities: 0.035 0.678 0.280 0.007 0.000
left son=4 (191 obs) right son=5 (98 obs)
Primary splits:
sulphates < 0.625 to the left, improve=12.978860, (0 missing)
volatile_acidity < 0.405 to the right, improve= 3.806351, (0 missing)
total_sulfur_dioxide < 99.5 to the right, improve= 3.591381, (0 missing)
alcohol < 9.05 to the right, improve= 1.491127, (0 missing)
citric_acid < 0.015 to the left, improve= 1.064644, (0 missing)
Surrogate splits:
volatile_acidity < 0.385 to the right, agree=0.696, adj=0.102, (0 split)
citric_acid < 0.395 to the left, agree=0.675, adj=0.041, (0 split)
Node number 3: 441 observations, complexity param=0.02285264
predicted class=6 expected loss=0.5079365 P(node) =0.6041096
class counts: 12 111 217 92 9
probabilities: 0.027 0.252 0.492 0.209 0.020
left son=6 (212 obs) right son=7 (229 obs)
Primary splits:
sulphates < 0.635 to the left, improve=18.33728, (0 missing)
citric_acid < 0.315 to the left, improve=12.39486, (0 missing)
total_sulfur_dioxide < 93 to the right, improve=12.36300, (0 missing)
alcohol < 11.45 to the left, improve=12.00557, (0 missing)
volatile_acidity < 0.425 to the right, improve=11.46317, (0 missing)
Surrogate splits:
citric_acid < 0.265 to the left, agree=0.696, adj=0.368, (0 split)
volatile_acidity < 0.485 to the right, agree=0.680, adj=0.335, (0 split)
total_sulfur_dioxide < 20.5 to the left, agree=0.592, adj=0.151, (0 split)
alcohol < 10.55 to the left, agree=0.592, adj=0.151, (0 split)
Node number 4: 191 observations
predicted class=5 expected loss=0.2146597 P(node) =0.2616438
class counts: 7 150 33 1 0
probabilities: 0.037 0.785 0.173 0.005 0.000
Node number 5: 98 observations, complexity param=0.0141844
predicted class=6 expected loss=0.5102041 P(node) =0.1342466
class counts: 3 46 48 1 0
probabilities: 0.031 0.469 0.490 0.010 0.000
left son=10 (30 obs) right son=11 (68 obs)
Primary splits:
volatile_acidity < 0.575 to the right, improve=2.7802320, (0 missing)
total_sulfur_dioxide < 28.5 to the right, improve=2.2665310, (0 missing)
alcohol < 9.525 to the right, improve=1.3554420, (0 missing)
sulphates < 0.915 to the right, improve=0.4882261, (0 missing)
citric_acid < 0.085 to the right, improve=0.3334526, (0 missing)
Surrogate splits:
citric_acid < 0.03 to the left, agree=0.755, adj=0.200, (0 split)
total_sulfur_dioxide < 103 to the right, agree=0.745, adj=0.167, (0 split)
alcohol < 9.75 to the right, agree=0.724, adj=0.100, (0 split)
Node number 6: 212 observations, complexity param=0.02285264
predicted class=6 expected loss=0.495283 P(node) =0.290411
class counts: 10 83 107 11 1
probabilities: 0.047 0.392 0.505 0.052 0.005
left son=12 (13 obs) right son=13 (199 obs)
Primary splits:
total_sulfur_dioxide < 89.5 to the right, improve=7.481048, (0 missing)
alcohol < 11.45 to the left, improve=7.182390, (0 missing)
sulphates < 0.545 to the left, improve=4.442367, (0 missing)
volatile_acidity < 0.575 to the right, improve=3.486944, (0 missing)
citric_acid < 0.095 to the right, improve=3.134632, (0 missing)
Node number 7: 229 observations, complexity param=0.02285264
predicted class=6 expected loss=0.5196507 P(node) =0.3136986
class counts: 2 28 110 81 8
probabilities: 0.009 0.122 0.480 0.354 0.035
left son=14 (164 obs) right son=15 (65 obs)
Primary splits:
alcohol < 11.65 to the left, improve=7.915158, (0 missing)
citric_acid < 0.315 to the left, improve=5.132199, (0 missing)
volatile_acidity < 0.425 to the right, improve=5.020187, (0 missing)
total_sulfur_dioxide < 85.5 to the right, improve=4.089458, (0 missing)
sulphates < 0.725 to the left, improve=4.017974, (0 missing)
Surrogate splits:
volatile_acidity < 0.14 to the right, agree=0.725, adj=0.031, (0 split)
sulphates < 1.12 to the left, agree=0.725, adj=0.031, (0 split)
Node number 10: 30 observations
predicted class=5 expected loss=0.3666667 P(node) =0.04109589
class counts: 2 19 9 0 0
probabilities: 0.067 0.633 0.300 0.000 0.000
Node number 11: 68 observations
predicted class=6 expected loss=0.4264706 P(node) =0.09315068
class counts: 1 27 39 1 0
probabilities: 0.015 0.397 0.574 0.015 0.000
Node number 12: 13 observations
predicted class=5 expected loss=0.07692308 P(node) =0.01780822
class counts: 0 12 0 1 0
probabilities: 0.000 0.923 0.000 0.077 0.000
Node number 13: 199 observations, complexity param=0.01182033
predicted class=6 expected loss=0.4623116 P(node) =0.2726027
class counts: 10 71 107 10 1
probabilities: 0.050 0.357 0.538 0.050 0.005
left son=26 (148 obs) right son=27 (51 obs)
Primary splits:
alcohol < 11.45 to the left, improve=5.977509, (0 missing)
total_sulfur_dioxide < 14.5 to the left, improve=2.998064, (0 missing)
sulphates < 0.595 to the left, improve=2.781150, (0 missing)
citric_acid < 0.095 to the right, improve=2.605720, (0 missing)
volatile_acidity < 0.285 to the right, improve=2.204497, (0 missing)
Surrogate splits:
volatile_acidity < 0.285 to the right, agree=0.779, adj=0.137, (0 split)
Node number 14: 164 observations, complexity param=0.01182033
predicted class=6 expected loss=0.4573171 P(node) =0.2246575
class counts: 2 26 89 43 4
probabilities: 0.012 0.159 0.543 0.262 0.024
left son=28 (7 obs) right son=29 (157 obs)
Primary splits:
total_sulfur_dioxide < 87.5 to the right, improve=5.245711, (0 missing)
citric_acid < 0.635 to the left, improve=3.484209, (0 missing)
volatile_acidity < 0.375 to the right, improve=3.358473, (0 missing)
sulphates < 0.735 to the left, improve=1.822213, (0 missing)
alcohol < 10.45 to the left, improve=1.576774, (0 missing)
Node number 15: 65 observations, complexity param=0.01182033
predicted class=7 expected loss=0.4153846 P(node) =0.0890411
class counts: 0 2 21 38 4
probabilities: 0.000 0.031 0.323 0.585 0.062
left son=30 (27 obs) right son=31 (38 obs)
Primary splits:
sulphates < 0.745 to the left, improve=4.645524, (0 missing)
volatile_acidity < 0.52 to the right, improve=3.699961, (0 missing)
total_sulfur_dioxide < 74 to the left, improve=1.855944, (0 missing)
citric_acid < 0.325 to the left, improve=1.854779, (0 missing)
alcohol < 12.85 to the left, improve=1.148283, (0 missing)
Surrogate splits:
total_sulfur_dioxide < 16.5 to the left, agree=0.677, adj=0.222, (0 split)
citric_acid < 0.385 to the left, agree=0.615, adj=0.074, (0 split)
alcohol < 11.85 to the left, agree=0.615, adj=0.074, (0 split)
Node number 26: 148 observations, complexity param=0.01182033
predicted class=6 expected loss=0.5135135 P(node) =0.2027397
class counts: 8 65 72 3 0
probabilities: 0.054 0.439 0.486 0.020 0.000
left son=52 (96 obs) right son=53 (52 obs)
Primary splits:
citric_acid < 0.075 to the right, improve=3.4484580, (0 missing)
total_sulfur_dioxide < 27.5 to the left, improve=2.8836290, (0 missing)
sulphates < 0.595 to the left, improve=2.4299510, (0 missing)
volatile_acidity < 0.9475 to the right, improve=1.7034560, (0 missing)
alcohol < 10.95 to the right, improve=0.8755424, (0 missing)
Surrogate splits:
volatile_acidity < 0.7825 to the left, agree=0.696, adj=0.135, (0 split)
sulphates < 0.625 to the left, agree=0.655, adj=0.019, (0 split)
Node number 27: 51 observations
predicted class=6 expected loss=0.3137255 P(node) =0.06986301
class counts: 2 6 35 7 1
probabilities: 0.039 0.118 0.686 0.137 0.020
Node number 28: 7 observations
predicted class=5 expected loss=0.1428571 P(node) =0.009589041
class counts: 0 6 1 0 0
probabilities: 0.000 0.857 0.143 0.000 0.000
Node number 29: 157 observations, complexity param=0.01182033
predicted class=6 expected loss=0.4394904 P(node) =0.2150685
class counts: 2 20 88 43 4
probabilities: 0.013 0.127 0.561 0.274 0.025
left son=58 (149 obs) right son=59 (8 obs)
Primary splits:
citric_acid < 0.635 to the left, improve=3.516554, (0 missing)
volatile_acidity < 0.375 to the right, improve=3.407650, (0 missing)
total_sulfur_dioxide < 48 to the right, improve=2.696381, (0 missing)
sulphates < 0.735 to the left, improve=1.534193, (0 missing)
alcohol < 10.45 to the left, improve=1.529523, (0 missing)
Node number 30: 27 observations
predicted class=6 expected loss=0.4444444 P(node) =0.0369863
class counts: 0 1 15 10 1
probabilities: 0.000 0.037 0.556 0.370 0.037
Node number 31: 38 observations
predicted class=7 expected loss=0.2631579 P(node) =0.05205479
class counts: 0 1 6 28 3
probabilities: 0.000 0.026 0.158 0.737 0.079
Node number 52: 96 observations, complexity param=0.01182033
predicted class=5 expected loss=0.46875 P(node) =0.1315068
class counts: 3 51 41 1 0
probabilities: 0.031 0.531 0.427 0.010 0.000
left son=104 (18 obs) right son=105 (78 obs)
Primary splits:
volatile_acidity < 0.6925 to the right, improve=2.8643160, (0 missing)
total_sulfur_dioxide < 56.5 to the left, improve=2.6915300, (0 missing)
citric_acid < 0.235 to the left, improve=1.7962040, (0 missing)
sulphates < 0.595 to the left, improve=1.6704620, (0 missing)
alcohol < 10.95 to the right, improve=0.9182023, (0 missing)
Surrogate splits:
total_sulfur_dioxide < 77 to the right, agree=0.844, adj=0.167, (0 split)
citric_acid < 0.135 to the left, agree=0.833, adj=0.111, (0 split)
Node number 53: 52 observations
predicted class=6 expected loss=0.4038462 P(node) =0.07123288
class counts: 5 14 31 2 0
probabilities: 0.096 0.269 0.596 0.038 0.000
Node number 58: 149 observations, complexity param=0.01182033
predicted class=6 expected loss=0.4161074 P(node) =0.2041096
class counts: 2 19 87 37 4
probabilities: 0.013 0.128 0.584 0.248 0.027
left son=116 (91 obs) right son=117 (58 obs)
Primary splits:
volatile_acidity < 0.375 to the right, improve=3.456376, (0 missing)
citric_acid < 0.475 to the right, improve=2.224812, (0 missing)
total_sulfur_dioxide < 48 to the right, improve=1.805652, (0 missing)
alcohol < 10.45 to the left, improve=1.622840, (0 missing)
sulphates < 0.735 to the left, improve=1.440520, (0 missing)
Surrogate splits:
citric_acid < 0.325 to the left, agree=0.745, adj=0.345, (0 split)
sulphates < 0.775 to the left, agree=0.685, adj=0.190, (0 split)
total_sulfur_dioxide < 13.5 to the right, agree=0.658, adj=0.121, (0 split)
Node number 59: 8 observations
predicted class=7 expected loss=0.25 P(node) =0.0109589
class counts: 0 1 1 6 0
probabilities: 0.000 0.125 0.125 0.750 0.000
Node number 104: 18 observations
predicted class=5 expected loss=0.2222222 P(node) =0.02465753
class counts: 1 14 3 0 0
probabilities: 0.056 0.778 0.167 0.000 0.000
Node number 105: 78 observations, complexity param=0.01182033
predicted class=6 expected loss=0.5128205 P(node) =0.1068493
class counts: 2 37 38 1 0
probabilities: 0.026 0.474 0.487 0.013 0.000
left son=210 (52 obs) right son=211 (26 obs)
Primary splits:
alcohol < 10.25 to the right, improve=3.2948720, (0 missing)
total_sulfur_dioxide < 56.5 to the left, improve=3.2305250, (0 missing)
citric_acid < 0.375 to the right, improve=1.2179490, (0 missing)
sulphates < 0.595 to the left, improve=0.9546520, (0 missing)
volatile_acidity < 0.6425 to the left, improve=0.6766238, (0 missing)
Surrogate splits:
volatile_acidity < 0.505 to the left, agree=0.756, adj=0.269, (0 split)
total_sulfur_dioxide < 49.5 to the left, agree=0.756, adj=0.269, (0 split)
sulphates < 0.45 to the right, agree=0.705, adj=0.115, (0 split)
Node number 116: 91 observations
predicted class=6 expected loss=0.3296703 P(node) =0.1246575
class counts: 2 12 61 15 1
probabilities: 0.022 0.132 0.670 0.165 0.011
Node number 117: 58 observations, complexity param=0.01182033
predicted class=6 expected loss=0.5517241 P(node) =0.07945205
class counts: 0 7 26 22 3
probabilities: 0.000 0.121 0.448 0.379 0.052
left son=234 (22 obs) right son=235 (36 obs)
Primary splits:
citric_acid < 0.475 to the right, improve=5.717172, (0 missing)
alcohol < 10.45 to the left, improve=3.517857, (0 missing)
total_sulfur_dioxide < 51 to the right, improve=2.925000, (0 missing)
volatile_acidity < 0.235 to the left, improve=2.551020, (0 missing)
sulphates < 0.685 to the left, improve=2.441667, (0 missing)
Surrogate splits:
alcohol < 10.75 to the left, agree=0.724, adj=0.273, (0 split)
volatile_acidity < 0.305 to the left, agree=0.638, adj=0.045, (0 split)
total_sulfur_dioxide < 8.5 to the left, agree=0.638, adj=0.045, (0 split)
Node number 210: 52 observations
predicted class=5 expected loss=0.4230769 P(node) =0.07123288
class counts: 1 30 20 1 0
probabilities: 0.019 0.577 0.385 0.019 0.000
Node number 211: 26 observations
predicted class=6 expected loss=0.3076923 P(node) =0.03561644
class counts: 1 7 18 0 0
probabilities: 0.038 0.269 0.692 0.000 0.000
Node number 234: 22 observations
predicted class=6 expected loss=0.2727273 P(node) =0.03013699
class counts: 0 3 16 2 1
probabilities: 0.000 0.136 0.727 0.091 0.045
Node number 235: 36 observations
predicted class=7 expected loss=0.4444444 P(node) =0.04931507
class counts: 0 4 10 20 2
probabilities: 0.000 0.111 0.278 0.556 0.056
n= 730
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 730 423 5 (0.03 0.42 0.41 0.13 0.012)
2) alcohol< 9.85 289 93 5 (0.035 0.68 0.28 0.0069 0)
4) sulphates< 0.625 191 41 5 (0.037 0.79 0.17 0.0052 0) *
5) sulphates>=0.625 98 50 6 (0.031 0.47 0.49 0.01 0)
10) volatile_acidity>=0.575 30 11 5 (0.067 0.63 0.3 0 0) *
11) volatile_acidity< 0.575 68 29 6 (0.015 0.4 0.57 0.015 0) *
3) alcohol>=9.85 441 224 6 (0.027 0.25 0.49 0.21 0.02)
6) sulphates< 0.635 212 105 6 (0.047 0.39 0.5 0.052 0.0047)
12) total_sulfur_dioxide>=89.5 13 1 5 (0 0.92 0 0.077 0) *
13) total_sulfur_dioxide< 89.5 199 92 6 (0.05 0.36 0.54 0.05 0.005)
26) alcohol< 11.45 148 76 6 (0.054 0.44 0.49 0.02 0)
52) citric_acid>=0.075 96 45 5 (0.031 0.53 0.43 0.01 0)
104) volatile_acidity>=0.6925 18 4 5 (0.056 0.78 0.17 0 0) *
105) volatile_acidity< 0.6925 78 40 6 (0.026 0.47 0.49 0.013 0)
210) alcohol>=10.25 52 22 5 (0.019 0.58 0.38 0.019 0) *
211) alcohol< 10.25 26 8 6 (0.038 0.27 0.69 0 0) *
53) citric_acid< 0.075 52 21 6 (0.096 0.27 0.6 0.038 0) *
27) alcohol>=11.45 51 16 6 (0.039 0.12 0.69 0.14 0.02) *
7) sulphates>=0.635 229 119 6 (0.0087 0.12 0.48 0.35 0.035)
14) alcohol< 11.65 164 75 6 (0.012 0.16 0.54 0.26 0.024)
28) total_sulfur_dioxide>=87.5 7 1 5 (0 0.86 0.14 0 0) *
29) total_sulfur_dioxide< 87.5 157 69 6 (0.013 0.13 0.56 0.27 0.025)
58) citric_acid< 0.635 149 62 6 (0.013 0.13 0.58 0.25 0.027)
116) volatile_acidity>=0.375 91 30 6 (0.022 0.13 0.67 0.16 0.011) *
117) volatile_acidity< 0.375 58 32 6 (0 0.12 0.45 0.38 0.052)
234) citric_acid>=0.475 22 6 6 (0 0.14 0.73 0.091 0.045) *
235) citric_acid< 0.475 36 16 7 (0 0.11 0.28 0.56 0.056) *
59) citric_acid>=0.635 8 2 7 (0 0.12 0.12 0.75 0) *
15) alcohol>=11.65 65 27 7 (0 0.031 0.32 0.58 0.062)
30) sulphates< 0.745 27 12 6 (0 0.037 0.56 0.37 0.037) *
31) sulphates>=0.745 38 10 7 (0 0.026 0.16 0.74 0.079) *
# 예측
predictions_dt <- predict(dt_model, X_test, type = "class")
y_test_f <- as.factor(y_test)
cm <- confusionMatrix(predictions_dt, y_test_f)
print(cm)
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 0 0 0 0 0
5 4 91 44 4 0
6 4 42 71 23 5
7 0 0 12 10 1
8 0 0 0 0 0
Overall Statistics
Accuracy : 0.5531
95% CI : (0.4959, 0.6092)
No Information Rate : 0.4277
P-Value [Acc > NIR] : 5.829e-06
Kappa : 0.2602
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.00000 0.6842 0.5591 0.27027 0.00000
Specificity 1.00000 0.7079 0.5978 0.95255 1.00000
Pos Pred Value NaN 0.6364 0.4897 0.43478 NaN
Neg Pred Value 0.97428 0.7500 0.6627 0.90625 0.98071
Prevalence 0.02572 0.4277 0.4084 0.11897 0.01929
Detection Rate 0.00000 0.2926 0.2283 0.03215 0.00000
Detection Prevalence 0.00000 0.4598 0.4662 0.07395 0.00000
Balanced Accuracy 0.50000 0.6960 0.5784 0.61141 0.50000
# 혼동 행렬을 기반으로 정밀도, 재현율, F1 점수를 계산
metrics <- data.frame(
class = rownames(cm$byClass),
precision = cm$byClass[, "Precision"],
recall = cm$byClass[, "Recall"],
f1 = cm$byClass[, "F1"]
)
# 클래스를 기준으로 오름차순 정렬
metrics <- metrics[order(metrics$class), ]
rownames(metrics) <- NULL
metrics[is.na(metrics)] <- 0
# 결과를 출력합니다.
print(metrics, row.names = FALSE)
class precision recall f1 Class: 4 0.0000000 0.0000000 0.0000000 Class: 5 0.6363636 0.6842105 0.6594203 Class: 6 0.4896552 0.5590551 0.5220588 Class: 7 0.4347826 0.2702703 0.3333333 Class: 8 0.0000000 0.0000000 0.0000000
# 랜덤 포레스트 모델
y_train <- as.factor(y_train)
rf_model <- randomForest(y_train ~ ., data = X_train)
print(summary(rf_model))
Length Class Mode call 3 -none- call type 1 -none- character predicted 730 factor numeric err.rate 3000 -none- numeric confusion 30 -none- numeric votes 3650 matrix numeric oob.times 730 -none- numeric classes 5 -none- character importance 5 -none- numeric importanceSD 0 -none- NULL localImportance 0 -none- NULL proximity 0 -none- NULL ntree 1 -none- numeric mtry 1 -none- numeric forest 14 -none- list y 730 factor numeric test 0 -none- NULL inbag 0 -none- NULL terms 3 terms call
# 예측
predictions_rf <- predict(rf_model, X_test)
predictions_rf_num <- as.numeric(as.character(predictions_rf))
predictions_rf_f <- as.factor(round(predictions_rf_num))
y_test_f <- as.factor(y_test)
cm <- confusionMatrix(predictions_rf_f, y_test_f)
print(cm)
Warning message in levels(reference) != levels(data): “두 객체의 길이가 서로 배수관계에 있지 않습니다” Warning message in confusionMatrix.default(predictions_rf_f, y_test_f): “Levels are not in the same order for reference and data. Refactoring data to match.”
Confusion Matrix and Statistics
Reference
Prediction 4 5 6 7 8
4 0 0 0 0 0
5 3 90 32 3 0
6 5 43 82 18 3
7 0 0 13 16 2
8 0 0 0 0 1
Overall Statistics
Accuracy : 0.6077
95% CI : (0.551, 0.6623)
No Information Rate : 0.4277
P-Value [Acc > NIR] : 1.308e-10
Kappa : 0.3609
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
Sensitivity 0.00000 0.6767 0.6457 0.43243 0.166667
Specificity 1.00000 0.7865 0.6250 0.94526 1.000000
Pos Pred Value NaN 0.7031 0.5430 0.51613 1.000000
Neg Pred Value 0.97428 0.7650 0.7188 0.92500 0.983871
Prevalence 0.02572 0.4277 0.4084 0.11897 0.019293
Detection Rate 0.00000 0.2894 0.2637 0.05145 0.003215
Detection Prevalence 0.00000 0.4116 0.4855 0.09968 0.003215
Balanced Accuracy 0.50000 0.7316 0.6353 0.68884 0.583333
# 혼동 행렬을 기반으로 정밀도, 재현율, F1 점수를 계산
metrics <- data.frame(
class = rownames(cm$byClass),
precision = cm$byClass[, "Precision"],
recall = cm$byClass[, "Recall"],
f1 = cm$byClass[, "F1"]
)
# 클래스를 기준으로 오름차순 정렬
metrics <- metrics[order(metrics$class), ]
rownames(metrics) <- NULL
metrics[is.na(metrics)] <- 0
# 결과를 출력합니다.
print(metrics, row.names = FALSE)
class precision recall f1 Class: 4 0.0000000 0.0000000 0.0000000 Class: 5 0.7031250 0.6766917 0.6896552 Class: 6 0.5430464 0.6456693 0.5899281 Class: 7 0.5161290 0.4324324 0.4705882 Class: 8 1.0000000 0.1666667 0.2857143
# SVM
svm_pred <- predict(model, X_test, type = "class")
svm_acc <- sum(svm_pred == y_test) / length(y_test)
# Decision Tree
dt_pred <- predict(dt_model, X_test, type = "class")
dt_acc <- sum(dt_pred == y_test) / length(y_test)
# Random Forest
rf_pred <- predict(rf_model, X_test, type = "class")
rf_acc <- sum(rf_pred == y_test) / length(y_test)
# Ordinal Logistic Regression
ordinal_model <- polr(as.ordered(y_train) ~ ., data = X_train, Hess=TRUE)
ordinal_pred <- predict(ordinal_model , X_test, type = "class")
ordinal_acc <- sum(ordinal_pred == y_test) / length(y_test)
# 결과 출력
cat("SVM Accuracy: ", svm_acc, "\n")
cat("Decision Tree Accuracy: ", dt_acc, "\n")
cat("Random Forest Accuracy: ", rf_acc, "\n")
cat("Ordinal Logistic Regression Accuracy: ", ordinal_acc, "\n")
SVM Accuracy: 0.5948553 Decision Tree Accuracy: 0.5530547 Random Forest Accuracy: 0.607717 Ordinal Logistic Regression Accuracy: 0.6109325
# 필요한 패키지 로드
library(pROC)
# 각 모델에 대한 예측 확률 계산
svm_prob <- predict(model, X_test, probability = TRUE)
dt_prob <- predict(dt_model, X_test, type = "prob")
rf_prob <- predict(rf_model, X_test, type = "prob")
svm_probabilities <- attr(svm_prob, "probabilities")
ordinal_prob <- predict(ordinal_model, X_test, type="probs")
# 각 모델과 클래스에 대한 AUROC 계산
svm_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_index <- which(colnames(svm_probabilities) == class)
class_prob <- svm_probabilities[, class_index]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
dt_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_prob = dt_prob[,class]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
rf_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_prob = rf_prob[,class]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
ordinal_rocs <- lapply(levels(as.factor(y_test)), function(class) {
if(sum(y_test == class) > 0){
class_prob = ordinal_prob[,class]
response = as.integer(y_test == class)
roc_obj = multiclass.roc(response = response, predictor = class_prob)
return(roc_obj)
}
else{
return(NULL)
}
})
# 각 모델에 대한 AUROC 출력
cat("[ SVM AUROC ]\n")
for (i in seq_along(svm_rocs)) {
cat("SVM AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(svm_rocs[[i]]), "\n")
}
cat("\n[ Decision Tree AUROC ]\n")
for (i in seq_along(dt_rocs)) {
cat("Decision Tree AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(dt_rocs[[i]]), "\n")
}
cat("\n[ Random Forest AUROC ]\n")
for (i in seq_along(rf_rocs)) {
cat("Random Forest AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(rf_rocs[[i]]), "\n")
}
cat("\n[ Ordinal Logistic Regression AUROC ]\n")
for (i in seq_along(ordinal_rocs)) {
cat("Ordinal Logistic Regression AUROC for class ", levels(as.factor(y_test))[i], ": ",
auc(ordinal_rocs[[i]]), "\n")
}
Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases Setting direction: controls < cases
[ SVM AUROC ] SVM AUROC for class 4 : 0.7838284 SVM AUROC for class 5 : 0.8073836 SVM AUROC for class 6 : 0.6907309 SVM AUROC for class 7 : 0.845433 SVM AUROC for class 8 : 0.8939891 [ Decision Tree AUROC ] Decision Tree AUROC for class 4 : 0.7192657 Decision Tree AUROC for class 5 : 0.768628 Decision Tree AUROC for class 6 : 0.6219403 Decision Tree AUROC for class 7 : 0.8131288 Decision Tree AUROC for class 8 : 0.8237705 [ Random Forest AUROC ] Random Forest AUROC for class 4 : 0.6984323 Random Forest AUROC for class 5 : 0.8191898 Random Forest AUROC for class 6 : 0.7095815 Random Forest AUROC for class 7 : 0.8747287 Random Forest AUROC for class 8 : 0.9076503 [ Ordinal Logistic Regression AUROC ] Ordinal Logistic Regression AUROC for class 4 : 0.6485149 Ordinal Logistic Regression AUROC for class 5 : 0.8184929 Ordinal Logistic Regression AUROC for class 6 : 0.684697 Ordinal Logistic Regression AUROC for class 7 : 0.8562833 Ordinal Logistic Regression AUROC for class 8 : 0.8896175
AUROC 값은 0에서 1 사이이며, 값이 높을수록 모델의 성능이 좋음을 의미한다. 0.5는 무작위 추측의 성능을 의미하고, 1은 완벽한 분류 능력을 의미한다.
모델별 AUROC 해석
SVM (서포트 벡터 머신)
의사결정 나무 (Decision Tree)
종합 분석
모델별 AUROC 해석
종합 분석
본 프로젝트에서는 모든 요소를 사용한 모델과 상관관계가 높은 5개 요소만을 사용한 모델 간의 성능 차이를 비교했다. 모든 요소를 사용한 모델에서는 랜덤 포레스트와 SVM이 높은 정확도를 보였으나, 5개 요소를 사용한 경우에는 그 정확도가 다소 감소했다. 이는 더 많은 요소를 사용함으로써 각 모델이 더 많은 정보를 활용하여 복잡한 패턴을 잘 포착할 수 있음을 시사한다.
상관관계가 높은 5개 요소만을 사용했을 때 성능이 월등히 개선되지 않은 이유는 여러 가지로 생각해볼 수 있다. 첫째, 상관관계가 높은 요소만을 선택하는 것은 다른 중요한 정보를 놓칠 수 있기 때문이다. 둘째, 상관관계가 높다고 해서 반드시 높은 예측력을 가지는 것은 아니며, 때로는 상관관계가 낮은 변수들이 모델의 예측력을 높이는데 기여할 수 있다.
이런 문제를 해결하기 위해, 변수 선택 과정에서 단순히 상관관계만을 고려하는 것이 아니라 다른 통계적 기법들을 활용하는 것이 유용할 수 있다. 예를 들어, 변수 중요도를 측정하는 랜덤 포레스트의 기능을 활용하거나, 변수 선택에 있어서 Lasso 등의 정규화 방법을 사용하는 것이 좋을 것이다. 이러한 방법을 통해 중요한 변수를 효과적으로 선택하면, 모델의 성능을 개선할 수 있을 것으로 기대된다.